SURVIVAL ANALYSIS

Biostatistics Support and Research Unit

Germans Trias i Pujol Research Institute and Hospital (IGTP)
Badalona, Spain

April 4, 2025

Summary

1. Introduction

2. Survival function

3. Comparing survival curves

4. Cox regression models

5. Take home messages

Introduction

Introduction

  • Aim: To analyse time-to-event data.

  • Most common events of interest: death, relapse, development of a specific disease, reintervention.

  • Survival time: time from some fixed starting point to occurrence of a given event or to lost to follow up.

  • Follow a subject until the event of interest occurs or is lost to follow-up.

  • At the end of the study we almost never observe the event of interest in all subjects

    • Event observed

    • Event not observed \(\longrightarrow\) Censored observation

Introduction


Representation of survival data

Introduction

Examples

  • Compare the ability of two different treatments to prolong survival in patients with a particular disease.

  • Identify risk/protective factors for time to having a stroke.

  • Study the relationship between diet and the time to the development of a cardiovascular disease in three different dietary groups.

  • Identify factors associated with time to organ rejection after transplantation.

Aims

  • Describe survival as a function of time \(\longrightarrow\) Kaplan-Meier method


  • Compare survival in different patient groups \(\longrightarrow\) Log-rank test


  • Model survival as a function of covariates \(\longrightarrow\) Cox proportional hazards model

Survival function

Survival function

  • Survival time (T): time to an event of interest (e.g, time from diagnosis to death).

  • Survival function: \[S(t) = P (T > t) = P (\mbox{an individual survives longer than t})\]

  • Interpretation

    • Probability of surviving beyond t.

    • Proportion of individuals surviving longer than t.

    • Probability that the event does not occur before t.

Survival function

Some properties

  • It depends on time and it is a non-increasing function.

  • \(S(0) = 1\)

  • It is usually described using a graph (survival curve).

Survival function

Survival data from a sample of \(n\) subjects from the study population should be as follows:

  • \(t_1, \ldots, t_n\): survival times.

  • \(\delta_1, \ldots, \delta_n\): censoring indicator, where

    • \(\delta_i=1\) if \(t_i\) is observed (event)
    • \(\delta_i=0\) if \(t_i\) is censored (no event)

Example dataset: Advanced lung cancer

  • First, install and load the {survival} package into your R session
install.packages("survival")
library(survival)


  • The {survival} package contains a wide range of functions to work with survival data.

Example dataset: Advanced lung cancer

  • We will work with an example dataset from this package:
lung

This dataset contains survival data on 228 patients with advanced lung cancer from the North Central Cancer Treatment Group

head(lung,3)
  inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
1    3  306      2  74   1       1       90       100     1175      NA
2    3  455      2  68   1       0       90        90     1225      15
3    3 1010      1  56   1       0       90        90       NA      15

Survival variables

  • time: Survival time in days

  • status: censoring status 1=censored, 2=dead

Example dataset

Note that status is coded in a non-standard way. Typically you will see 1=event, 0=censored. Let’s recode it to avoid confusion:

library(dplyr)
lung <- 
  lung |> 
  mutate(
    status = case_when(
      status == 1 ~ 0,
      status == 2 ~ 1,
      .default = status
    )
  )

Survival function

Surv() function creates survival objects from variables time and event.

surv_obj <- Surv(time = lung$time, event = lung$status)  # creates a survival object
head(surv_obj, 10)   # prints first 10 values   
 [1]  306   455  1010+  210   883  1022+  310   361   218   166 

It results in a value for each subject: the time followed by + if censored


Event coding: Be sure that the event indicator is correctly encoded

Surv() function accepts events coded as TRUE/FALSE, 1/0 or 2/1.

TRUE, 1, 2 will be read as the event. FALSE, 0, 1 will be read as censored observations.

Kaplan-Meier estimator

  • Non-parametric method to estimate the survival function.

  • It is one of the most widely used methods in survival analysis.

  • Results in a step function, where there is a step down each time an event occurs.

Kaplan-Meier method

  • The survfit() function estimates survival curves using Kaplan-Meier (KM) method

  • You must specify a formula with a survival object (left-hand side of the formula) and 1 (right-hand side)

km <- survfit(Surv(time = time, event = status) ~ 1, data = lung)  

Kaplan-Meier method

And we can plot the survival curve with ggsurvplot() function from {survminer} package .

install.packages("survminer")
library(survminer)
ggsurvplot(km) #?ggsurvplot

Kaplan-Meier method

We can customise the survival curve

library(survminer)
ggsurvplot(km, xlab = "Days", legend = "none", censor=FALSE, conf.int=FALSE, risk.table=TRUE)

Kaplan-Meier method

We can get KM survival estimates at specific times using the summary() function on a survfit object

summary(km, time = c(365, 730))     # survival estimates at 1 and 2 years
Call: survfit(formula = Surv(time = time, event = status) ~ 1, data = lung)

 time n.risk n.event survival std.err lower 95% CI upper 95% CI
  365     65     121    0.409  0.0358       0.3447        0.486
  730     13      38    0.116  0.0283       0.0716        0.187


  • 1-year probability of survival in this study is 41%.

  • Lower and upper bounds of the 95% confidence interval (CI) are also displayed.

Kaplan-Meier method

  • Another quantity of interest in a survival analysis is the median survival time.

  • We can calculate the median survival time by printing the survfit object

km
Call: survfit(formula = Surv(time = time, event = status) ~ 1, data = lung)

       n events median 0.95LCL 0.95UCL
[1,] 228    165    310     285     363

Kaplan-Meier method

  • We can get other quantiles and 95% CI using the quantile() function
quantile(km)
$quantile
 25  50  75 
170 310 550 

$lower
 25  50  75 
145 285 460 

$upper
 25  50  75 
197 363 654 

Comparing survival curves

Comparing survival curves

  • Aim: To compare survival times between two (or more) groups of subjects.

  • Differences can be illustrated by plotting the Kaplan-Meier curve for each group.

  • This is not enough to conclude that the groups do or do not have different survival times.

  • Are these differences significant or just random variations? \(\longrightarrow\) Log-rank test

Log-rank test

  • Is a hypothesis test used to compare survival distributions.

  • It is non-parametric and provides a comparison of the overall survival of the groups.

  • It is appropriate when survival curves do not cross.

  • Test:

    • \(H_0: S_1(t)=S_2(t)\)
    • \(H_1: S_1(t)\neq S_2(t)\)

Log-rank test in R

  • We get the log-rank p-value using the survdiff() function.

  • Do survival functions differ between the sexes?

library(dplyr)
# define factor for sex 
lung <- lung |>
   mutate(sex = factor(sex, levels=1:2, labels = c("male", "female")))
# specify formula
survdiff(Surv(time, status) ~ sex, data = lung)                            
Call:
survdiff(formula = Surv(time, status) ~ sex, data = lung)

             N Observed Expected (O-E)^2/E (O-E)^2/V
sex=male   138      112     91.6      4.55      10.3
sex=female  90       53     73.4      5.68      10.3

 Chisq= 10.3  on 1 degrees of freedom, p= 0.001 

p-value = 0.001 \(\longrightarrow\) significant difference in overall survival according to sex

Comparing survival curves in R

We can plot the survival curves for each group with ggsurvplot() function and add the log-rank test.

ggsurvplot(survfit(Surv(time, status) ~ sex, data = lung),xlab = "Days",censor=FALSE, conf.int=FALSE, risk.table=TRUE, pval=TRUE, pval.method=TRUE)

Comparing survival curves in R


What if we need to…

  • Study the effect of a continuous variable?

  • Analyse more than one variable?

  • Quantify the effect of a possible risk/protective factor?

\(\longrightarrow\) Cox models

Cox regression models

Cox regression models

  • Cox regression models or Cox proportional hazards models examine the effect of several variables on survival.

  • Both quantitative and qualitative variables can be studied.

  • They are a useful tool for identifying risk/prognosis factors.

Examples

  • What is the effect of ethnicity, gender and socioeconomic status on the risk of developing a particular disease?

  • In cancer patients, is the risk of relapse higher in older patients? Can we quantify this increased risk?

Cox regression models

  • Hazard function: \(h(t)\) represents the hazard of experiencing the event of interest at time t.

  • Cox model formulation:

\[h(t | X_1, X_2, \ldots, X_p)=h_0(t) \cdot \exp(\beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p)\]


\(X_1, \ldots, X_p\): Variables of interest.

\(h_0(t)\): Baseline hazard function.

\(\beta_1, \ldots, \beta_p\): Regression coefficients of each variable.

Cox regression models

  • \(\beta_1, \ldots, \beta_p\): indicate the magnitude of the effect of the corresponding variable.

  • For \(i=1,\ldots,p\), the \(p\)-values associated with these estimates correspond to

    • \(H_0: \beta_i=0\)
    • \(H_1: \beta_i\neq0\)

and they indicate whether the \(i\)-th variable has a statistically significant effect on survival/hazard of the event.

Hazard ratios

  • The hazard ratio (HR) for the \(i\)-th variable is \(\exp(\beta_i) = e^{\beta_i}\)

    • Quantitative variables: Change in the risk of the event occurring for each unit increase in the variable at any t.

    • Qualitative variables: Change in the risk of the event occurring for each category of the variable with respect to the reference category at any t.

HR interpretation

  • \(\mbox{Hazard ratio}>1 \rightarrow\) The risk of the event occurring increases.

  • \(\mbox{Hazard ratio}<1 \rightarrow\) The risk of the event occurring decreases.

  • \(\mbox{Hazard ratio}=1 \rightarrow\) The risk of the event occurring remains the same.

Cox regression models in R (qualitative variable)

We can fit a Cox model on R with coxph() function to compare the risk of death between different categories of sex:

m1 <- coxph(Surv(time, status) ~ sex, data = lung)
m1
Call:
coxph(formula = Surv(time, status) ~ sex, data = lung)

             coef exp(coef) se(coef)      z       p
sexfemale -0.5310    0.5880   0.1672 -3.176 0.00149

Likelihood ratio test=10.63  on 1 df, p=0.001111
n= 228, number of events= 165 

Cox regression models in R (qualitative variable)

We can visualise the model results in a fancy table:

  • Using tbl_regression() function from {gtsummary} package

  • Specifying exp=TRUE to show Hazard Ratio.

library(gtsummary)
coxph(Surv(time, status) ~ sex, data = lung) |> 
  tbl_regression(exp = TRUE)
Characteristic HR 95% CI p-value
sex


    male
    female 0.59 0.42, 0.82 0.001
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio

Cox regression models in R (qualitative variable)

coxph(Surv(time, status) ~ sex, data = lung) |> 
  tbl_regression(exp = TRUE)
Characteristic HR 95% CI p-value
sex


    male
    female 0.59 0.42, 0.82 0.001
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio
  • Male is the reference category \(\longrightarrow\) HR calculated for females relative to males.

  • p.value=0.001 and 95% CI does not cross 1 \(\longrightarrow\) Sex is associated with death.

  • \(\mbox{HR}_{\mbox{female}} < 1 \longrightarrow\) Females have lower hazard of death than males.

Cox regression models in R (qualitative variable)

coxph(Surv(time, status) ~ sex, data = lung) |> 
  tbl_regression(exp = TRUE)
Characteristic HR 95% CI p-value
sex


    male
    female 0.59 0.42, 0.82 0.001
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio
  • HR = 0.59 \(\longrightarrow\) The risk of death was 0.59 times lower for females than for males at any t (or the risk of death was 41% lower for females)

  • \(\mbox{HR}_{\mbox{male}}=\frac{1}{\mbox{HR}_{\mbox{female}}}=\frac{1}{0.59}=1.69 \longrightarrow\) The risk of death is almost 70% higher for men.

Cox regression models in R (quantitative variable)

coxph(Surv(time, status) ~ age, data = lung) |> 
  tbl_regression(exp = TRUE)
Characteristic HR 95% CI p-value
age 1.02 1.00, 1.04 0.042
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio
  • p.value=0.042 \(\longrightarrow\) Age is associated with the risk of death.

  • HR > 1 \(\longrightarrow\) If age increases, the risk of death increases.

  • HR = 1.02 \(\longrightarrow\) At any t, for a 1-year increase, the risk of death is 1.02 times the risk in the previous year.

  • HR = 1.02 \(\longrightarrow\) An increase of 1 year increases the hazard of death by 2%.

  • HR for a 5-year increase: \[e^{5 \cdot \beta_{age}} = (e^{\beta_{age}})^5=(\mbox{HR}_{age})^5=1.02^5=1.10\]

Cox regression models in R (more than one variable)

Let’s build a Cox model to study the effect of age, sex and ph.ecog on death:

coxph(Surv(time, status) ~ age + sex + ph.ecog, data = lung) |> 
  tbl_regression(exp = TRUE)
Characteristic HR 95% CI p-value
age 1.01 0.99, 1.03 0.2
sex


    male
    female 0.58 0.41, 0.80 <0.001
ph.ecog 1.59 1.27, 1.99 <0.001
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio
  • With sex and ph.ecog on the model, age has not a significant effect on death

  • Females have lower hazard of death than males (HR<1)

  • ph.ecog increases the hazard of death (HR>1)

Model validation

  • Cox models assume that the proportional hazards assumption is met for each covariate \(\longrightarrow\) the hazard ratios between groups remain constant over time.

  • The proportional hazards assumption should be tested:

    • Proportional hazards test.
    • Plots of Schoenfeld residuals.
    • Log-log plots

Model validation: Proportional hazards test

  • The proportional hazards assumption in a Cox model can be tested with cox.zph() function.

  • The null hypothesis of proportionality of hazards for each variable in the model is tested against the hypothesis of non-proportionality of hazards.

m2 <- coxph(Surv(time, status) ~ age + sex + ph.ecog, data = lung)
test.ph <- cox.zph(m2)
test.ph
        chisq df    p
age     0.188  1 0.66
sex     2.305  1 0.13
ph.ecog 2.054  1 0.15
GLOBAL  4.464  3 0.22

Model validation: Proportional hazards test

m2 <- coxph(Surv(time, status) ~ age + sex + ph.ecog, data = lung)
test.ph <- cox.zph(m2)
test.ph
        chisq df    p
age     0.188  1 0.66
sex     2.305  1 0.13
ph.ecog 2.054  1 0.15
GLOBAL  4.464  3 0.22
  • In our model, all p-values > 0.05 \(\longrightarrow\) the null hypothesis is not rejected.

  • Our model meets the proportional hazards assumption.

  • Very sensitive test (specially with high sample sizes) \(\longrightarrow\) take a look at the plots.

Model validation: Graphs of Schoenfeld residual

  • Graphical diagnostics using ggcoxzph() function ({survminer} package)

  • Generates plots of Schoenfeld residuals versus time for each covariate.

ggcoxzph(test.ph)


  • Proportional hazards assumes that estimates do not vary much over time.

  • Deviations from a horizontal line are indicative of non-proportional hazards.

Model validation: Graphs of Schoenfeld residual

ggcoxzph(test.ph)[1]
$`1`

Model validation: Graphs of Schoenfeld residual

ggcoxzph(test.ph)[2]
$`2`

Model validation: Graphs of Schoenfeld residual

ggcoxzph(test.ph)[3]
$`3`

Model validation: Log-log plots

  • Plot the log-log Kaplan Meier survival estimates against time (or log of time).

  • Evaluate whether the curves are parallel and do not cross.

  • Only useful for categorical variables.

km2 <- survfit(Surv(time, status) ~ sex, data = lung)
ggsurvplot(km2, fun = "cloglog")

Non-proportional hazards

  • If the assumption of proportional hazards is violated:
m3 <- coxph(Surv(time, status) ~ sex + meal.cal, data = lung)
test.ph <- cox.zph(m3)
test.ph
         chisq df     p
sex       1.45  1 0.229
meal.cal  4.76  1 0.029
GLOBAL    6.49  2 0.039

Non-proportional hazards

ggcoxzph(test.ph)[2]
$`2`
  • Add interaction of the covariate with time.

  • Stratify for the variable that violates the proportional hazards assumption.

Take home messages

Take home messages

  • Time-to-event variables cannot be analysed using standard methods.

  • Kaplan-Meier is the most common method to estimate survival curves.

  • Survival curves can be compared using the log-rank test.

  • Cox models are the regression models used to study the association between survival and one or more predictor variables.


📅 Thanks & See You Next Time! 👋